<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <meta content="width=device-width, initial-scale=1.0" name="viewport">

  <title>Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation | Project Page</title>
  <meta content="" name="Project page for 'Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation'">
  <meta content="" name="Nhat-Tuong Do-Tran, Tuan-Ngoc Nguyen">
  
  <!-- Google Fonts -->
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:300,300i,400,400i,600,600i,700,700i|Raleway:300,300i,400,400i,500,500i,600,600i,700,700i|Poppins:300,300i,400,400i,500,500i,600,600i,700,700i" rel="stylesheet">

  <!-- Vendor CSS Files -->
  <link rel="stylesheet"
  href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
  <link rel="stylesheet" href="./static/css/bulma.min.css">
  <link href="./static/bootstrap/css/bootstrap.min.css" rel="stylesheet">
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/fontawesome.min.css" >
  
  <!-- Publication Files -->
  <link href="./static/css/publication_style.css" rel="stylesheet">
  
  <script defer src="./static/js/fontawesome.all.min.js"></script>

</head>

<body>
  <main id="main">
    <section class="publication">
      <div class="container">
        <div class="publication-title">          
          <h5>CVPR 2024</h5>
          <h1 class="title is-1", style="color:#0047ab;">Learning CNN on ViT: </h1>
          <h2>A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation</h2>
          <h4>   
            <a href="https://scholar.google.com/citations?user=_9iFmuYAAAAJ" target="_blank" data-bs-toggle="tooltip" data-bs-placement="bottom" data-bs-html="true" title="<b>Co-first Author</b> <br> Go to Google Scholar">Ba Hung Ngo<sup>1,<p class="co-first">*</p></sup></a>,
            <a href="https://scholar.google.com/citations?user=UUT7AlUAAAAJ" target="_blank" data-bs-toggle="tooltip" data-bs-placement="bottom" data-bs-html="true" title="<b>Co-first Author</b> <br> Go to Google Scholar">Nhat-Tuong Do-Tran<sup>2,<p class="co-first">*</p></sup></a>,
            <a href="https://scholar.google.com/citations?user=la0uCvsAAAAJ" target="_blank" data-bs-toggle="tooltip" data-bs-placement="bottom" data-bs-html="true" title="<b>Co-Author</b> <br> Go to Google Scholar">Tuan-Ngoc Nguyen<sup>3</sup></a>,
            <a href="https://scholar.google.com/citations?user=Ei00xroAAAAJ" target="_blank" data-bs-toggle="tooltip" data-bs-placement="bottom" data-bs-html="true" title="<b>Co-Author</b> <br> Go to Google Scholar">Hae-Gon Jeon<sup>4</sup></a>,
            Tae Jong Choi<sup>1,<p class="co-author">†</p></sup>
          </h4>
          <h6>
            <sup>1</sup>Graduate School of Data Science, Chonnam National University, South Korea <br /> 
            <sup>2</sup>Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan <br />
            <sup>3</sup>Digital Transformation Center, FPT Telecom, VietNam, <sup>4</sup>AI Graduate School, GIST, South Korea <br />
          </h6>
          
          <div class="column has-text-centered">
            <div class="publication-links">
               <!-- PDF Link. -->
               <span class="link-block">
                <a href="https://cvpr.thecvf.com/Conferences/2024/AcceptedPapers"
                    class="external-link button is-normal is-rounded is-dark">
                  <span class="icon">
                    <i class="fas fa-file-pdf"></i>
                  </span>
                  <span>Paper (PDF)</span>
                </a>
               </span>

                <!-- arXiv Link. -->
                <span class="link-block">
                  <a href="https://arxiv.org/abs/2403.18360/"
                      class="external-link button is-normal is-rounded is-dark">
                    <span class="icon">
                      <i class="ai ai-arxiv"></i>
                    </span>
                    <span>arXiv</span>
                  </a>
                </span>

                <!-- Video Link. -->
                <span class="link-block">
                  <a href="https://www.youtube.com/"
                      class="external-link button is-normal is-rounded is-dark">
                    <span class="icon">
                      <i class="fab fa-youtube"></i>
                    </span>
                    <span>Video (YouTube)</span>
                  </a>
                </span>

                <!-- Poster Link. -->
                <span class="link-block">
                  <a href="https://raw.githubusercontent.com/dotrannhattuong/ECB/main/images/poster_cvpr2024.png"
                     class="external-link button is-normal is-rounded is-dark">
                    <span class="icon">
                        <i class="far fa-images"></i>
                    </span>
                    <span>Poster</span>
                  </a>
                </span>

                <!-- Code Link. -->
                <span class="link-block">
                  <a href="https://github.com/dotrannhattuong/ECB"
                     class="external-link button is-normal is-rounded is-dark">
                    <span class="icon">
                        <i class="fab fa-github"></i>
                    </span>
                    <span>Code</span>
                  </a>
                </span>
            </div>
         </div>

        </div>
      </div>

      <div class="container">
        <div class="publication-detail">
          <h3>Abstract</h3>
          <p>Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations. 
            This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN, called <b>E</b>xplicitly <b>C</b>lass-specific <b>B</b>oundaries (<b>ECB</b>). 
            ECB learns CNN on ViT to combine their distinct strengths. In particular, we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. 
            In contrast, the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally, ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models.
            Compared to conventional DA methods, our ECB achieves superior performance, which verifies its effectiveness in this hybrid model.</p>
          <br><br>

          <h3>Method</h3>
          <img src="./imgs/method.svg" alt="Image" style='height: 100%; width: 100%; max-width: 1000px; object-fit: contain'>          
          <p>An overall framework of the proposed Finding to Conquering strategy. We use ViT to build E<sub>1</sub> that drives two classifiers F<sub>1</sub> and F<sub>2</sub> to expand class-specific boundaries comprehensively. Besides, we select CNN for the second encoder E<sub>2</sub> to cluster target features based on the boundaries identified by ViT. These encoders use two classifiers F<sub>1</sub>, F<sub>2</sub>.</p>          
          <br><br>

          <h3>Results SSDA Setting</h3>
          <img src="./imgs/result_ssda.png" alt="Image" style='height: 100%; width: 100%; max-width: 1000px; object-fit: contain'>          
          <p>The CNN branch of our <b>ECB</b> method outperforms all prior methods. In comparison to the nearest-competitor method, G-ABC , the ECB (CNN) achieves an impressive maximum performance increase of <b>+9.3%</b> in the <i>skt→pnt</i> task for 3-shot learning. Even in the more restrictive 1-shot learning, the ECB method demonstrates robust performance, showing a increase of <b>+3.1%</b> in the <i>rel→clp</i> task. On average, the ECB method validates a performance improvement of <b>+6.6%</b> in the 1-shot setting and <b>+7.1%</b> in the 3-shot setting.</p>
          <br><br>

          <h3>Results UDA Setting</h3>
          <img src="./imgs/result_uda.png" alt="Image" style='height: 100%; width: 100%; max-width: 1000px; object-fit: contain'>          
          <p>Our approach has recorded accuracy enhancements of <b>+7.7%,</b> <b>+8.1%,</b> and <b>+7.2%</b> for the <i>C→A</i>, <i>C→R</i>, and <i>P→A</i> tasks, respectively, surpassing the results of the second-best. In addition, our method has achieved an impressive average classification accuracy of 81.2%, showing a remarkable margin of <b>+5.4%</b> over the nearest-competitor EIDCo.</p>
          <br><br>

          <h3>Visualize T-SNE</h3>
          <img src="./imgs/tsne.png" alt="Image" style='height: 100%; width: 100%; max-width: 1000px; object-fit: contain'>   
          <p>We visualize feature spaces for the <i>rel→skt</i> task on DomainNet in the 3-shot using t-SNE. Figures (a) and (b) illustrate the features obtained by CNN and ViT branches before adaptation. Figures (c) and (d) showcase the distribution changes of the CNN branch depending on the presence of the FTC strategy when implementing our ECB method.</p>
          <br><br>

          <h3>Visualize GRAD-CAM</h3>

          <table border="1">
            <!-- BIRD -->
            <tr>
              <th>Bird</th>
              <th>CNN (Before)</th>
              <th>ViT (Before)</th>
              <th>CNN (After)</th>
              <th>ViT (After)</th>
            </tr>
            <tr>
              <td><img src="./imgs/bird_Origin.svg" alt="Bird Origin"></td>
              <td><img src="./imgs/bird_CNN_warmup.svg" alt="Bird CNN Warmup"></td>
              <td><img src="./imgs/bird_VIT_warmup.svg" alt="Bird VIT Warmup"></td>
              <td><img src="./imgs/bird_CNN_adapt.svg" alt="Bird CNN ECB"></td>
              <td><img src="./imgs/bird_VIT_adapt.svg" alt="Bird VIT ECB"></td>
            </tr>
            
            <!-- CANNON -->
            <tr>
              <th>Cannon</th>
              <th>CNN (Before)</th>
              <th>ViT (Before)</th>
              <th>CNN (After)</th>
              <th>ViT (After)</th>
            </tr>
            <tr>
              <td><img src="./imgs/cannon_Origin.svg" alt="Cannon Origin"></td>
              <td><img src="./imgs/cannon_CNN_warmup.svg" alt="Cannon CNN Warmup"></td>
              <td><img src="./imgs/cannon_VIT_warmup.svg" alt="Cannon VIT Warmup"></td>
              <td><img src="./imgs/cannon_CNN_adapt.svg" alt="Cannon CNN ECB"></td>
              <td><img src="./imgs/cannon_VIT_adapt.svg" alt="Cannon VIT ECB"></td>
            </tr>

            <!-- CATUS -->
            <tr>
              <th>Cactus</th>
              <th>CNN (Before)</th>
              <th>ViT (Before)</th>
              <th>CNN (After)</th>
              <th>ViT (After)</th>
            </tr>
            <tr>
              <td><img src="./imgs/cactus_Origin.svg" alt="Cactus Origin"></td>
              <td><img src="./imgs/cactus_CNN_warmup.svg" alt="Cactus CNN Warmup"></td>
              <td><img src="./imgs/cactus_VIT_warmup.svg" alt="Cactus VIT Warmup"></td>
              <td><img src="./imgs/cactus_CNN_adapt.svg" alt="Cactus CNN ECB"></td>
              <td><img src="./imgs/cactus_VIT_adapt.svg" alt="Cactus VIT ECB"></td>
            </tr>

            <!-- PEANUT -->
            <tr>
              <th>Peanut</th>
              <th>CNN (Before)</th>
              <th>ViT (Before)</th>
              <th>CNN (After)</th>
              <th>ViT (After)</th>
            </tr>
            <tr>
              <td><img src="./imgs/peanut_Origin.svg" alt="Peanut Origin"></td>
              <td><img src="./imgs/peanut_CNN_warmup.svg" alt="Peanut CNN Warmup"></td>
              <td><img src="./imgs/peanut_VIT_warmup.svg" alt="Peanut VIT Warmup"></td>
              <td><img src="./imgs/peanut_CNN_adapt.svg" alt="Peanut CNN ECB"></td>
              <td><img src="./imgs/peanut_VIT_adapt.svg" alt="Peanut VIT ECB"></td>
            </tr>

            <!-- CANOE -->
            <tr>
              <th>Canoe</th>
              <th>CNN (Before)</th>
              <th>ViT (Before)</th>
              <th>CNN (After)</th>
              <th>ViT (After)</th>
            </tr>
            <tr>
              <td><img src="./imgs/canoe_Origin.svg" alt="Canoe Origin"></td>
              <td><img src="./imgs/canoe_CNN_warmup.svg" alt="Canoe CNN Warmup"></td>
              <td><img src="./imgs/canoe_VIT_warmup.svg" alt="Canoe VIT Warmup"></td>
              <td><img src="./imgs/canoe_CNN_adapt.svg" alt="Canoe CNN ECB"></td>
              <td><img src="./imgs/canoe_VIT_adapt.svg" alt="Canoe VIT ECB"></td>
            </tr>

            <!-- ANT -->
            <tr>
              <th>Ant</th>
              <th>CNN (Before)</th>
              <th>ViT (Before)</th>
              <th>CNN (After)</th>
              <th>ViT (After)</th>
            </tr>
            <tr>
              <td><img src="./imgs/ant_Origin.svg" alt="Ant Origin"></td>
              <td><img src="./imgs/ant_CNN_warmup.svg" alt="Ant CNN Warmup"></td>
              <td><img src="./imgs/ant_VIT_warmup.svg" alt="Ant VIT Warmup"></td>
              <td><img src="./imgs/ant_CNN_adapt.svg" alt="Ant CNN ECB"></td>
              <td><img src="./imgs/ant_VIT_adapt.svg" alt="Ant VIT ECB"></td>
            </tr>
          </table>
          <!-- <img src="./imgs/grad_cam1.png" alt="Image" style='height: 100%; width: 100%; max-width: 1000px; object-fit: contain'>          
          <img src="./imgs/grad_cam2.png" alt="Image" style='height: 100%; width: 100%; max-width: 1000px; object-fit: contain'>           -->


          <br><br>
          <h3>BibTeX</h3>
          <pre>@inproceedings{ECB,<br>  title={Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation},<br>  author={Ba Hung Ngo, Nhat-Tuong Do-Tran, Tuan-Ngoc Nguyen, Hae-Gon Jeon, Tae Jong Choi},<br>  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},<br>  year={2024}<br>}</pre>
        </div>
      </div>
    </section>
  </main>

  <!-- Vendor JS Files -->
  <script src="./static/bootstrap/js/bootstrap.bundle.min.js"></script>

  <!-- Main JS File -->
  <script type="text/javascript">
    var tooltipTriggerList = [].slice.call(document.querySelectorAll('[data-bs-toggle="tooltip"]'))
    var tooltipList = tooltipTriggerList.map(function (tooltipTriggerEl) {
      return new bootstrap.Tooltip(tooltipTriggerEl)
    })
    var popoverTriggerList = [].slice.call(document.querySelectorAll('[data-bs-toggle="popover"]'))
    var popoverList = popoverTriggerList.map(function (popoverTriggerEl) {
      return new bootstrap.Popover(popoverTriggerEl)
    })
  </script>

</body>
</html>

<!-- https://getbootstrap.com/docs/4.0/components/buttons/ -->
<!-- https://getbootstrap.com/docs/5.0/getting-started/download/ -->
